A short description of the post.
GAStech is a company produces gas production in Abila, Kronos. Many of its employees have company cars which are approved for both personal and business use. Those who do not have company cars have the ability to check out company trucks for business use, but these trucks cannot be used for personal business.
Employees with company cars are happy to have these vehicles, because the company cars are generally much higher quality than the cars they would be able to afford otherwise. However, GAStech does not trust their employees. Without the employees’ knowledge, GAStech has installed geospatial tracking software in the company vehicles. The vehicles are tracked periodically as long as they are moving.
Hence, this article is trying to identify which purchases made by which employees and identify suspicious patterns of behavior to make recommendations for further investigation.
Attention: Company cars can be used for personal and business use, but company trucks cannot be used for personal use
The same dataset have been used in 2014 VAST Challenge, which listed in Reference. Some groups using heatmap to show the relationship of location, date, time etc. City University London created theire own map by using Abila road network. Central South University used heatmap to show the transactions. All of these methods are worth to refer. Hence, in the report, we will mainly use heatmap and bipartite graph to show the relationship. And then try to show the suspectable route or behavior by tm_map.
car-assignments.csv: A list of vehicle assignments by employee, which including Employee Last Name, Employee First Name, Car ID, Current Employment Type and Current Employment Title
Geospatial folder: ESRI shapefiles of Abila and Kronos
gps.csv: A file of vehicle tracking data, which including Timestamp, Car ID, Latitude and Longitude
loyalty_data.csv: A file containing loyalty card transaction data, which including Timestamp, Location, Price and Loyalty Number
cc_data.csv: A file containing credit and debit card transaction data, which including Timestamp, Location, Price and Last 4 digits of the credit or debit card number
MC2-Tourist.jpg: A tourist map of Abila with locations of interest identified
packages = c('raster','sf','tmap','clock','tidyverse','lubridate','ggiraph',
'ggthemes','viridis','plotly','treemapify','igraph','ggpubr',
'readr','mapview')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Abila_st <- st_read(dsn = "data/Geospatial", layer = "Abila")
Reading layer `Abila' from data source `C:\RY-Yan\R4VA\_posts\2021-07-15-vast-challenge-2021-mc1\data\Geospatial' using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS: WGS 84
gps <- read_csv("data/gps.csv")
car <- read_csv("data/car-assignments.csv")
cd <- read_csv("data/cc_data.csv")
loyalty <- read_csv("data/loyalty_data.csv")
Attention: * Timestamp field is not in date-time format in Aspatial data file, credit and debit card transaction data file and loyalty card transaction data file. * id field in Aspatial data and car data files should be in factor data type. Last4CCNum should be in character data type
Hence, below codes are used to format the data:
gps$Timestamp <- date_time_parse(gps$Timestamp,
zone = "",
format = "%m/%d/%Y %H:%M")
cd$timestamp <- date_time_parse(cd$timestamp,
zone = "",
format = "%m/%d/%Y %H:%M")
loyalty$timestamp <- date_time_parse(loyalty$timestamp,
zone = "",
format = "%m/%d/%Y")
gps$id <- as_factor(gps$id)
car$CarID <- as_factor(car$CarID)
cd$last4ccnum <- as_factor(cd$last4ccnum)
When quickly go through the data files, we noticed that it’s neccessary for us to clean the data before analysis.
# In order to seperate the time into different time periods, we need to create a column just for hours in the credit card data
cd$hour <- as.numeric(format(cd$timestamp,"%H"))
period <- function(period){
sapply(period, function(x) if(x >= 12 & x < 18) "Afternoon (12noon~5.59pm)"
else if (x >= 6 & x < 12) "Morning (6am~11.59am)"
else if (x >= 18 & x < 20) "Evening (6pm~7.59pm)"
else if (x >= 20 & x < 24) "Night (8pm~11.59pm)"
else "Late Night (12mn~5.59am)"
)
}
cd$Time_perids <- period(cd$hour)
#Add new coloums for date, weekday and day of week
cd$date <- as.Date(cd$timestamp)
cd$weekday <- wday(cd$timestamp, label = TRUE)
cd$day <- day(cd$timestamp) %>% as_factor
loyalty$weekday <- wday(loyalty$timestamp, label = TRUE)
loyalty$day <- day(loyalty$timestamp) %>% as_factor
loyalty$popular_location <- paste(loyalty$timestamp,loyalty$location,loyalty$price)
#Use the same way as credit card data to format gps data
gps$date <- as_date(gps$Timestamp)
gps$hour <- hour(gps$Timestamp)
gps$Time_perids <- period(gps$hour)
gps$day <- day(gps$Timestamp) %>% as_factor()
gps$weekday <- wday(gps$Timestamp, label = TRUE)
Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies? Please limit your answer to 8 images and 300 words.
To gather the information of the most popular locations, we will be using the location and frequency data.
cd_locations <- unique(cd$location)
cdcount_location <- cd %>% group_by(location) %>%
summarize(count = n())
loy_locations <- unique(loyalty$location)
loycount_location <- loyalty %>% group_by(location) %>%
summarize(count = n())
cdplot <- ggplot(cdcount_location,
aes(x = count,
y = reorder(location,count),
fill = "pink",
stringr::str_wrap(cdcount_location$location, 15))) +
geom_col(color="grey", fill="light blue") +
#scale_fill_viridis(discrete = T,option = "H") +
xlab("Frequency") + ylab("Location") +
ggtitle("Popularity of each place (Credit)") +
theme(axis.text.x = element_text(face="bold", color="#000092",
size=8, angle=0),
axis.text.y = element_text(face="bold", color="#000092",
size=8, angle=0),
panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
loyplot <- ggplot(loycount_location,
aes(x = count,
y = reorder(location,count),
fill = "pink",
stringr::str_wrap(loycount_location$location, 15))) +
geom_col(color="grey", fill="light blue") +
#scale_fill_viridis(discrete = T,option = "H") +
xlab("Frequency") + ylab("Location") +
ggtitle("Popularity of each place (Loyalty)") +
theme(axis.text.x = element_text(face="bold", color="#000092",
size=8, angle=0),
axis.text.y = element_text(face="bold", color="#000092",
size=8, angle=0),
panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
ggarrange(cdplot,loyplot,ncol = 2)

From analysis of the locations, we noticed that for credit card and loyalty card data, the frequency of each places that employees visit are almost the same. “Katerina’s Cafe”, “Hippokampos”, “Guy’s Gyros”, “Hallowed Grounds” and “Brew’ve Been Served” are the TOP 5 popular places among employees.
cd_calendar <- cd %>% count(day, location) %>% as_factor()
cdcalendarmap <- ggplot(complete(cd_calendar, day, location), aes(x = day, y = location)) +
geom_tile(aes(fill = n), color = "black", size = 0.1) +
scale_fill_gradient(low = "light blue", high = "blue", na.value = "light grey") +
scale_y_discrete(expand = expansion(add = 1),
limits=rev) +
labs(title = "Heatmap of Visit Frequency",
subtitle = "(Credit card data)",
x = "Day of Month",
fill = "Frequency") +
theme_bw() +
theme(axis.ticks = element_blank(),
panel.border = element_blank(),
panel.spacing = unit(0.5, "cm"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
text = element_text(size=7),
axis.title.x = element_text(vjust=-5),
axis.title.y = element_blank(),
legend.position = "top")
cdcalendarmap

cd_calendar2 <- cd %>% count(hour, location) %>% as_factor()
cdcalendarmap2 <- ggplot(complete(cd_calendar2, hour, location), aes(x = hour, y = location)) +
scale_x_continuous(breaks = 0:24)+
geom_tile(aes(fill = n), color = "black", size = 0.1) +
scale_fill_gradient(low = "light blue", high = "blue", na.value = "white") +
scale_y_discrete(expand = expansion(add = 1),
limits=rev) +
labs(title = "Heatmap of Visit Frequency",
subtitle = "(Credit card data)",
x = "Hour",
fill = "Frequency") +
theme_bw() +
theme(axis.ticks = element_blank(),
panel.spacing = unit(0.5, "cm"),
panel.grid.major = element_blank(),
text = element_text(size=7),
axis.title.x = element_text(vjust=-5),
legend.position = "top")
cdcalendarmap2

For the places we listed above, Katerina’s Cafe and Hippokampos were two special locations since they were popular not only on weekdays but also at weekends while people only went to other three places on weekdays. In addition, for some coffee shops, the transaction only occur at special hours in credit card data, such as 7 o’clock for Hallowed Grounds, Coffe Cameleon and Brew’ve Been Served.
Some suspectable transactions occur in Kronos Mart and Daily Dealz at 3am and 6 am, respectively.
#First, we need to combine the credit card and loyalty card data
cd_loyalty <- cd %>%
inner_join(loyalty, by = c("date" = "timestamp",
"location" = "location",
"price" = "price"),
method = "osa",
max_dist = 1,
distance_col = "distance")
#Then we select the useful data
#Try to find if there are some associate transactions
cd_loyalty1 <- cd_loyalty %>%
group_by(last4ccnum,loyaltynum) %>%
count() %>%
ungroup()
#cdduplicates <- cd_loyalty1 %>%
# filter(n>3) %>%
# filter(cd_loyalty1$last4ccnum == #cd_loyalty1$last4ccnum[duplicated(cd_loyalty1$last4ccnum)])
#loyaltyduplicates <- cd_loyalty1 %>%
# filter(n>3) %>%
# filter(cd_loyalty1$loyaltynum == #cd_loyalty1$loyaltynum[duplicated(cd_loyalty1$loyaltynum)])
#One loyalty card links to two credit cards
loyalty_cd_duplicates <- subset(cd_loyalty1,loyaltynum == "L6267" | loyaltynum == "L3288")
#One credit card links to two loyalty cards
cd_loyalty_duplicates <- subset(cd_loyalty1,last4ccnum == "1286")
We inner join credit card and loyalty card data based on date, location and price. Generally, one credit card should only link to one loyalty card. However the results showed more than one distinct pair, which means the credit/loyalty card owner could be using more than one loyalty/credit cards. This showed all possible relationship between card owners. In addition, the location in inner join file didn’t contain Daily Dealz. This implied that transactions at this location likely done by either of the card only.
Then will use igraph to built a bipartite graph with nodes to show the relationships between credit card and loyalty card.
duplicates <- rbind(loyalty_cd_duplicates,cd_loyalty_duplicates)
g <- graph.data.frame(duplicates,directed = TRUE)
V(g)$type <- bipartite_mapping(g)$type
col <- c("sky blue", "orange")
shape <- c("circle", "square")
E(g)$color <- 'steelblue'
plot(g, layout = layout.bipartite,
vertex.color = col[as.numeric(V(g)$type)+1], vertex.size = 15, vertex.label.cex = 0.8,
vertex.shape = shape[as.numeric(V(g)$type)+1],
edge.label = E(g)$n, edge.label.cex = 0.8, edge.label.color = "black", legend = TRUE)

From the bipartite graph, L6267 was used by both credit card 6691 and 6899. We also think there should be some form of relationship between credit card 1286 and 9241 holders because of the high co-used of the loyalty card L3288.
For these issues, we recommend that using full join to join credit and loyalty card data files, removing the duplicates. Then mapping of credit cards to the loyalty card based on the transactions of loyalty card only.
Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.
# Clean the data
# Create new column for full name to make records unique
car$fullname <- paste(car$FirstName,car$LastName, sep = " ")
gps1<-gps %>%
filter(gps$id!='101',
gps$id!='104',
gps$id!='105',
gps$id!='106',
gps$id!='107')
gps1<-left_join(gps1, car, by = c("id"="CarID"))
route <- ggplot() +
geom_point(data = gps1,
aes(x = long, y = lat,color = CurrentEmploymentType),
size = 0.8)+
facet_grid(day~CurrentEmploymentType) + theme_bw() +
ggtitle(label = "Movement Route by Current Employment Type",
subtitle = "(Company cars)") + theme(
plot.title = element_text(color = "black", size = 15),
plot.subtitle = element_text(size = 10)
)
route

car_data <- gps %>%
group_by(id,hour) %>%
summarise(n = n()) %>%
ungroup()
ggplot(car_data,aes(x = hour,y = id,fill = n)) + geom_tile()+
scale_fill_gradient(low = "light grey", high = "black")
cdcalendarmap2

Compare the transaction records generated by Car ID and location, we could easily find that the colored areas were not exactly the same, which means some people went out by car but did not spend money.
Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.
First, we combined the gps and car data.
gps_car <- left_join(gps,car, by = c("id" = "CarID"))
# Sort first by ID in ascending order
gps_car <- gps_car[with(gps_car,order(id, Timestamp)),]
# Create a new column to compute the time time interval at which a car was recorded
gps_car <- gps_car %>%
mutate(Interval = Timestamp - lag(Timestamp, default = first(Timestamp)))
gps_car$Interval <- as.numeric(gps_car$Interval)
#Remove the observations were less than 15 minutes
gps_car <- gps_car %>%
filter(Interval > 900)
#Remove the duplicates
cdre <- cd_loyalty %>%
filter(!(last4ccnum == "1286" & loyaltynum == "L3288"
& !is.na(last4ccnum) & !is.na(loyaltynum))) %>%
filter(!(last4ccnum == "6899" & loyaltynum == "L6267"
& !is.na(last4ccnum) & !is.na(loyaltynum)))
#Inner join gps_car data and credit/loyalty card data to identify the card owner
cd_loyalty_gps <- gps_car %>%
inner_join(cdre, by = c("Timestamp" = "timestamp",
"Location" = "location"),
method = "osa",
max_dist = 1,
distance_col = "distance")
#Remove useless columns
cd_loyalty_gps1 <- cd_loyalty_gps[,-c(3:13,15:17,19:23,25:27)]
#Remove NA rows
cd_loyalty_gps1 <- cd_loyalty_gps1 %>%
filter(!(fullname == "NA"))
DT::datatable(cd_loyalty_gps1)
Although we could identify most of the card owners, some of them still had duplicates. For example, Ingrid Barranco, Lucas Alcazar, Loreto Bodrogi, Sven Flecha, Hennie Osvaldo, Varja Lagos, Minke Mies, Kanon Herrero and Marin Onda. Hence, we removed their records just to show the cards that we could identify their owners.
cd_loyalty_gps2 <- cd_loyalty_gps1 %>%
filter(!(fullname == "Ingrid Barranco")) %>%
filter(!(fullname == "Lucas Alcazar")) %>%
filter(!(fullname == "Loreto Bodrogi")) %>%
filter(!(fullname == "Sven Flecha")) %>%
filter(!(fullname == "Hennie Osvaldo")) %>%
filter(!(fullname == "Varja Lagos")) %>%
filter(!(fullname == "Minke Mies")) %>%
filter(!(fullname == "Kanon Herrero")) %>%
filter(!(fullname == "Marin Onda"))
DT::datatable(cd_loyalty_gps2)
Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.
bgmap <- raster('data/Geospatial/MC2-tourist.tif')
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255)
gps_sf <- st_as_sf(gps,
coords = c("long","lat"),
crs = 4326)
gps_path <- gps_sf %>%
group_by(id, day,hour) %>%
summarize(m = mean(Timestamp),
do_union=FALSE) %>%
st_cast("LINESTRING")
p = npts(gps_path, by_feature = TRUE)
gps_path <- cbind(gps_path, p)
gps_path <- gps_path %>%
cbind(gps_path,p) %>%
filter(p > 1)
Seems that Lidelse and Birgitta often ate outside together.
Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why Please limit your response to 10 images and 500 words.
According to gps data, we known that trucks’ movement often occurs from 7am to 9pm, while the rush periods for company cars are 7am to 8am, 12am to 1pm and 5pm to 7pm. In addition, there were some suspectable movements caused by car ID 1, 15, 16, 21, 24 and 26 at 3am in Jan 7th ,9th ,11th and 14th.
car_data1 <- gps %>%
group_by(day,hour) %>%
summarise(n = n()) %>%
ungroup()
car5.1 <- expand.grid(day = unique(gps$day),
hour = c(1:23))
plot_car5.1 <- car_data1 %>%
right_join(
car5.1,
by = c('day','hour')
) %>%
mutate(hour = as.ordered(hour)) %>%
mutate(day = as.ordered(day)) %>%
replace_na(list(n = 0L))
ggplot(plot_car5.1, aes(hour,day,fill = n)) +
geom_tile(color = 'white',size = 0.1) +
scale_fill_gradient(low = "light grey", high = "black")
car_data <- gps %>%
group_by(id,hour) %>%
summarise(n = n()) %>%
ungroup()
ggplot(car_data,aes(x = hour,y = id,fill = n)) + geom_tile()+
scale_fill_gradient(low = "light grey", high = "black")

Hence, we tried to plot their route.
5.1.1 Car ID: 15
Bodrogi Loreto who has the car No.15 went back home at 5pm, Jan 6th. However, he drove to Speston Park at 3am, Jan 7th and then directly went to company in the morning.
Similarly, he went to Taxiarchan Park at 3am, Jan 9th.
gps_path15_2 <- gps_path %>%
filter(id == 15,day == 9 ,hour == 3)
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path15_2) + tm_lines()
5.1.2 Car ID: 24
Mies Minke, the owner of car No.24 who also went to same place, Taxiarchan Park, as Bodrogi Loreto. Hence, we could say they possibly met together.
gps_path24 <- gps_path %>%
filter(id == 24,day == 9,hour == 3)
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path24) + tm_lines()
5.1.3 Car ID 21 & 16, 21 & 24
Same as Bodrogi Loreto and Mies Minke, Vann Isia and Osvaldo Hennie met at Ahhaggo Museum at 3 am, Jan 11th.
Osvaldo Hennie and Mies Minke also met at Ahhaggo Museum at 3 am, Jan 14th.
Therefore, we could assume that some people from Site and Perimeter Control might have suspectable meeting.
5.2.1 Transaction with large amount
cdpriceplot <- ggplot(cd,
aes(x = price,
y = reorder(location,price),
fill = "pink"))+
geom_boxplot() +
xlab("Price") + ylab("Location") +
ggtitle("Transactions of each place (Credit)") +
theme(axis.text.x = element_text(face="bold", color="#000092",
size=8, angle=0),
axis.text.y = element_text(face="bold", color="#000092",
size=8, angle=0))
cdpriceplot

According to the box plot made by credit card data, Lucas Alcazar spent a large sum of money at Frydo’s Autosupply n’ More.
5.2.2 Transaction at midnight
cd1<- cd
cd1$time<-format(as.POSIXct(cd1$timestamp), format = "%H:%M:%S")
cd1<-cd1%>%
relocate(date,time,price,location,last4ccnum)
cd1 <- cd1[order(cd1$time,cd1$date), ]
cd1<- cd1%>%
filter(time>="01:00:00"&time<="9:00:00")
DT::datatable(cd1)
We made a data table for credit card transactions, and found that there were five suspectable transactions that occured around 3am on Jan 11th, Jan 12th and Jan 18th in Kronos Mart.
VAST Challenge 2021 VAST Challenge 2014-1 *VAST Challenge 2014-2